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ABSTRACT 



In one exemplary embodiment the invention provides-^^daia. 
m ining system for use in finding clusters of data items in a 
database or any other data storage medium. Before the data 
evaluation begins a choice is made of the number M of 
models to be explored, and the number of clusters (K) of 
clusters within each of the M models. The clusters are used 
in categorizing the data in the database into K different 
clusters within each model. An initial set of estimates for a 
data distribution of each model to be explored is provided. 
ITien a portion of the data in the database is read from a 
storage medium and brought into a rapid access memory 
buffer whose size is determined by the user or operating 
system depending on available memory resources. Data 
contained in the data buffer is used to update the original 
model data distributions in each of the K clusters over all M 
models. Some of the data belonging to a cluster is summa- 
rized or compressed and stored as a reduced form of the data 
representing sufficient statistics of the data. More data is 
accessed from the database and the models are updated. An 
updated set of parameters for the clusters is determined from 
the summarized data (sufficient statistics) and the newly 
acquired data. Stopping criteria are evaluated to determine if 
further data should be accessed from the database. 

34 Claims, 9 Drawing Sheets 
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SCALABLE SYSTEM FOR EXPECTATION 
MAXIMIZATION CLUSTERING OF LARGE 
DATABASES 

RELAl ED APPLICATION 

The present patent application is a continuation in part of 
U.S. patent application Ser. No. 09/040,219 to Fayyad et al 
entitled A Scalable System for Clustering of Large Data- 
bases which was filed in the United States Patent and 
Trademark Office on Mar. 17, 1998. The subject matter of 
this co-pending application is incorporated herein by refer- 
ence. 

FIELD OF TIIE INVENTION 

The present invention concerns database analysis and 
more particularly concerns apparatus and method for clus- 
tering of data into data sets that characterize the data. 

BACKGROUND ART 

Large data sets are now commonly used in most business 
organizations. In fact, so much data has been gathered that 
asking even a simple question about the data has become a 
challenge. The modern information revolution is creating 
huge data stores which, instead of offering increased pro- 
ductivity and new opportunities, are threatening to drown 
the users in a flood of information. Tapping into large 
databases for even simple browsing can result in an explo- 
sion of irrelevant and unimportant facts. Even people who 
do not 'own' large databases face the overload problem 
when accessing databases on the Internet. A large challenge 
now facing the database community is how to sift through 
these databases to find useful information. 

Existing database management systems (DBMS) perform 
the steps of reliably storing data and retrieving the data using 
a data access language, typically SQL. One major use of 
database technology is to help individuals and organizations 
make decisions and generate reports based on the data 
contained in the database. 

An important class of problems in the areas of decision 
support and reporting are clustering (segmentation) prob- 
lems where one is interested in finding groupings (clusters) 
in the data. Clustering has been studied for decades in 
statistics, pattern recognition, machine learning, and many 
other fields of science and engineering. However, imple- 
mentations and applications have historically been limited to 
small data sets with a small number of dimensions. 

Each cluster includes records that are more similar to 
members of the same cluster than they are similar to rest of 
the data. For example, in a marketing application, a com- 
pany may want to decide who to target for an ad campaign 
based on historical data about a set of customers and how 
they responded to previous campaigns. Other examples of 
such problems include: fraud detection, credit approval, 
diagnosis of system problems, diagnosis of manufacturing 
problems, recognition of event signatures, etc. Employing 
analysts (statisticians) to build cluster models is expensive, 
and often not effective for large problems (large data sets 
with large numbers of fields). Even trained scientists can fail 
in the quest for reliable clusters when the problem is 
high-dimensional (i.e. the data has many fields, say more 
than 20). 

A goal of automated analysis of large ' databases is to 
extract useful information such as models or predictors from 
the data stored in the database. One of the primary opera- 
tions in data mining is clustering (also known as database 
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segmentation). Clustering is a necessary step in the mining 
of large databases as it represents a means for finding 
segments of the data that need to be modeled separately. This 
is ah especially important consideration for large databases 

5 where a global model of the entire data typically makes no 
sense as data represents multiple populations that need to be 
modeled separately. Random sampling cannot help in decid- 
ing what the clusters are. Finally, clustering is an essential 
step if one needs to perform density estimation over the 

10 database (i.e. model the probability distribution governing 
the data source). 

Applications of clustering are numerous and include the 
following broad areas: data mining, data analysis in general, 
data visuahzation, samphng, indexing, prediction, and com- 

15 pression. Specific applications in data mining including 
markerin g, fraud dete clioo-nn ^redit cards^ ba aking»_ and 
telec ommunications ^ customer retention and churn minimi- 
zation (in all sorts of services including airiines, telecom- 
munication services, internet services, and web information 

20 services in general), direct marketing on the web and live 
marketing in Electronic Commerce. 

The framework we present in this invention satisfies the 
following Data Mining criteria: 

1. Require one scan (or less) of the database if possible: 
a single data scan is considered costly, early termination if 
appropriate is highly desirable. 

2. On-line "anytime" behavior: a "best" answer is always 
available from the system, with status information on 

30 progress, expected remaining time, etc. 

3. Suspendable, stoppable, resumable; incremental 
progress saved to resuming a stopped job. 

4. Abihty to incrementally incorporate additional data 
with existing models efficiently. 

5. Work within confines of a given limited RAM buffer 

6. Utilize variety of possible scan modes: sequential, 
index, and sampling scans if available. 

7. Ability to operate with forward-only curscir over a view 
4Q of the database. This is necessary since the database view 

may be a result of an expensive join query, over a potentially 
distributed data warehouse, with much processing required 
to construct each row (case). 

SUMMARY OF THE INVENTION 

45 

The present invention concerns a process for scaling a 
known probabilistic clustering algorithm, the EM algorithm 
(Expectation Maximization) to large databases. The method 
retains in memory only data points that need to be present in 

50 memory. The majority of the data points are summarized 
into a condensed representation that represents their suffi- 
cient statistics. By analyzing a mixture of sufBcient statistics 
and actual data points, the invention achieves much better 
clustering results than random sampling methods and with 

55 dramatically lower memory requirements. The invention 
allows the clustering process to terminate before scanning 
all the data in the database, hence gaining a major advantage 
over other scalable clustering methods that require at a 
minimum a full data scan. 

60 The technique embodied in our invention relies on the 
observation that the EM algorithm applied to a mbtture of 
Gaussians does not need to rescan all the data items as it is 
originally defined and as implemented in popular literature 
and statistica lli braries and analysis packages. The method 

65 can'^Se vie wed" as all intellig ent saniplin£_ scheme th at 
emplo ys some^ theoreficaUy justli^ ^ for decidTng 

which data can be summarized and represented by a sig- 
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nificantly compressed set of s ujScient statisj ics. and which ) records stored on multiple, possibly distributed storage 
data items must be maintained as individualSEta records in / devices. Each r 



RAM, and hence occupy a valuable resource. On any given 
iteration of the algorithm, the sampled data is partitioned 
into three subsets: A discard set (DS), a compression set 
(CS), and a retained set (RS). For the first two sets, data is 



devices. Each record has many attributes or fields which for 
a representative database might include age, income, num- 
ber of children, number of cars owned etc. A goal of the 
invention is to characterize clusters of data in the database 
10. This task is straightforward for small databases (all data 



discarded but representative sufficient statistics are kept that \ fits in the memory of a computer for example) having 
summarize the subsets. Tht retained data"serRS is kept in \ records that have a small number of fields or attributes. The 

memory. task becomes very difficult, however, for large databases 

In accordance with an exemplary embodiment of the having huge numbers of records with a high dimension of 

invention, data clustering is performed by characterizing an attributes. 

initial set of data clusters with a data distribution and The published literature describes the characterization of 

obtaining a portion of data from a database stored on a data clusters using the so-called Expectation-Maximization 

storage medium. This data is then assigned to each of the (£iyi) analysis procedure. The EM procedure is summarized 

plurality of data clusters with a weighting factor The in an article entitled "Maximum likelihood fi-om incomplete 

weighting factor assigned to a given data record is used to ^^^^ em algorithm". Journal of the Royal Statistical 

updatcthecharacierizationofthedataclusters. Someof the Society B, vol 39, pp. 1-38 (1977). The EM process 

data that was accessed from the database is then summarized estimates the parameters of a model iteratively, starting from 

or compressed based upon a specified criteria to produce ^n initial estimate. Each iteration consists of an Expectation 

sufficient statistics for the data satisfying the specified gj^p^ ^^^^h finds a distribution for unobserved data (the 

criteria. The process ofgathering data from the database and cluster labels), given the known values for the observed 

characterizing the data clusters based on newly sampled data jata. 

from the database takes place until a stopping criteria has pj^. 9 ^ two dimensional plot showing the position of 

been satisfied. ^^^^ p^-j^^g extracted from a database 10. One can visually 

The invention differs from a clustering system that uses a determine that the data in FIG. 9 is lumped or clustered 

traditional K-means clustering choice where each data point together. Once a cluster number K is chosen, one can 

is assigned to one and only one cluster. In this so called EM associate data with a given one of these clusters. If one 

process using an expectation maximization process, each chooses a cluster number of *3' for the data of FIG. 9, one 

data point is assigned to all of the clusters but it is assigned jg ^ble to associate a given data item with one of the three 

with a weighted factor that normahzes the contributions. clusters (K-3) of the figure. In a so-caUed K-means clus- 

ITiese and other objects, advantages and features of the (ering technique disclosed in the Fayyad et al patent appH- 

invention will become understood from a review of an nation referenced previously, the data points belong or are 

exemplary embodiment of the invention which is described assigned to a single cluster. A process for evaluating a 

in conjunction with the accompanying drawings. database using the K-means process is describing in our 
BRIEF DESCRIPTION OF THE DRAWINGS 35 copending United States patent application entitled "A scal- 

1 . , -11 * *• e * * f able method for K-means clustering of large Databases" 

FIG. lis a schematic illustration of a computer system for , • n j * « . * j j 1 

^. . , . ^. ^ filed m the United States Patent and Trademark Office on 

use m practicmg the present invention; ^, ^_ ^rx^^r^co i.- u • ^ 

T^^ ^, • f . , . . r 1 ,.1 1 . • Mar. 17, 1998, now U.S. Pat. No. 6,012,058, which issued 

FIG. 2 IS schemauc depicUon of a scalable clustermg ^000 and which is assigned to the assignee of the 

system, present application and is also incorporated herein by ref- 

FIGS. 3A and 3B are schematics of software components ere nee 

of the invention, an expectation maximization (EM) clustering analysis, 

FIG. 4 IS a flow diagram of an exemplary embodmient of ^^^^^^ assigning each data point in FIG. 9 to a cluster 

the invention; ^^^^ calculating the mean or average of that cluster, each 

FIG, 5 is a one-dimensional plot of a data distribution of data point has a probability or weighting factor for each of 

two data clusters illustrating how a data point is assigned to ^^^^ k clusters that characterize the data. For the EM analysis 

each cluster; used in conjunction with an exemplary embodiment of the 

FIGS. 6A-6D are illustrations of data structures utilized present invention, one associates a Gaussian distribution of 

for storing data in accordance with the exemplary embodi- data about the centroid of each of the K clusters in FIG. 9, 
ment of the invention; 50 Consider the one dimensional depiction shown in FIG. 5 

FIG. 7 is a plot of two data clusters that illustrates a of two Gaussians G1,_G2 representing clusters that have 

contribution to those clusters from data points extracted centroids or means of P,P. The compactness of the data 

from a database; within a cluster is generally indicated by the shape of the 

FIGS. 8A-8C is a flow chart of an extended EM cluster- Gaussian and the average value of the cluster is given by the 
ing procedure; 55 mean. Now consider the data point identified along the axis 

HG. 9 is a two dimensional depiction of data clusters as the point "X." This data point 'belongs' to both the 

showing a position of a mean for the data of those clusters; clusters identified by the Gaussians 01, G2. This data point 

FIG. 10 is a data structure for a multiple model embodi- 'belongs' to the Gaussian G2 with a weighting factor pro- 

ment of the present invention; and portional to h2 (probability density value) that is given by 

FIG. 11 is a depiction of a user interface for monitoring 60 the vertical distance from the horizontal axis to the curve 

progress of a scalable clustering practiced in accordance ^^'a point X 'belongs' to the cluster characterized 

with the invention. Gaussian Gl with a weighting factor proportional to 

hi given by the vertical distance from the horizontal axis to 

DETAILED DESCRIPTION OF EXEMPLARY the Gaussian Gl. We say that the data point X belongs 
EMBODIMENT OF THE INVENTION ^5 fractionaUy to both clusters. The weighting factor of its 

The present invention has particular utility for use in membership to Gl is given by hl/(hl+h2+Hrest); similarly 

characterizing data contained in a database 10 having many it belongs to G2 with weight h2/(hl+h2+Hrest). Hrest is the 
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sum of the heights of the curves for all other clusters A processor unit 21 of the computer 20 next performs 120 

(Gaussians). If the height in other clusters is negligible one an extended EM analysis of the data in memory. The term 

can think of a "fraction" of the case belonging to cluster 1 'extended' is used to distinguish the disclosed process from 

(represented by Gl) while the rest belongs to cluster 2 a prior art EM clustering analysis. Classical (prior art) EM 

(represented by G2). For example, if hl=0.13 and h2=0.03, 5 clustering operates on data records. The disclosed extended 

then0.13/((0.13+0.03)-0.8of the case belongs to cluster 1, implementation works over a mix of data records and 

whHe 0.2 of it belongs to cluster 2. This contrasts with the sufficient statistics representing sets of data records. The 

K-Means model of the copending Fayyad et al application Processor 21 evaluates the data brought into memory and 

_ which places the case in exactly one cluster. iteratively determmes a model of that data for each of the K 

r\ , , , , _ clusters. A data structure 122 for the results or output model 

In accordance with the present mvention, the data from lo extended EM analysis is depicted in FIG. 6D. 
^he database 10 is brought into a memory 22 (FIG. 1) and an ^^^^ ^ 4 flowchart some of the 
output model is created from the data by a data mimng ^ata used in the present iteration to characterize the K 
engine 12. {FIGS. 2, 3A, 3B) In a client/server implemen- clusters is summarized and compressed. This summarization 
tation an application program 14 acts as the client and the is contained in the data structures of FIGS. 6A and 6B which 
d.Ma ^iDing engine 12 the server, llie application is- the ^5 j^j^^ significanUy less storage in memory 25 than the 
recipientof an output model and makes use of that mode! in vector data structtu-e needed to store individual records, 
one of a number of possible ways such as fraud detection etc. Storing a summarization of the data in the data structures of 
The model is characterized by an array of pointers, one FIGS. 6B and 6C frees up more memory allowing additional 
each for the K clusters of the EM model. Each pointer points ^^^^ ^® sampled from the database 10. Additional itera- 
te a vector summarizing a mean for each dimension of the ^^^^^ extended EM analysis are performed on this 
data and a second vector indicating the spread of the data. A 

data structure for the representative or exemplary embodi- Before looping back to get more data the processor 21 

ment of a clustering model produced by the present inven- determmes 140 whether a stopping criteria has been reached, 

tion is depicted in FIG. 6D. As the EM model is calculated, ^^^PP^^S "^^^ 

some of the recently acquired data that was used to deter- ^nalysis is good eriough by a standard that is described 

mine the model is compressed. AL the data used to model the ^'^T h f "t^h °' r'/l^S^"'^ "T'""" a l"" 

r ,u u * A reached if all the data in the database has been used in the 

databa.se IS then stored in one of three data subsets. A ^M analysis. One important aspect of the invention is the 

retained data set (RS) is kept in memory 22 for further use f^^^ ^^^^ -^^^^^ of stopping the analysis, the analysis can be 

in performing the EM analysis. A discarded data set (DS) suspended. Data in the data structures of FIGS. 6A-6D can 

and a compressed data set (CS) are summarized in the fomi be saved (either in memory or to disk) and the extended EM 

of sufiEcient statistics. The suflficient statistics are retained in process can then be resumed later. This allows the database 

memory. to be updated and the analysis resumed to update the EM 

analysis without starting from the beginning. It also allows 

Overview of Scalable EM another process to take control of the processor 21 without 

HG. 4 is a flow chart of the process steps performed losing the state of the EM analysis. Such a suspension could 

during a scalable EM analysis of data. An initiaHzation step initiated in response to a user request that the 

100 sets up a number of data structures shown in FIG. analysisbesuspendedbyineans of a user actuated control on 

6A-6D and begins with a cluster number K for character- f presented to the user on a monitor 47 while the 

izing the data. ™ ^°^y^^^ ^ performed. 

. 1 ^- - i FIGS. 10-14 illustrate user interface screens that are 

A next step 110 is to sample a portion of the data in the ^ ^^^^^ ^ ^^^^^^ 47 ^ ^^^^ ^ clustered. These screens 

database 10 from a storage medmm to bring that portion of illustrated in the example of the clustering framework 

data within a rapid access memory (into RAM for example, described in this invention apphed to scaUng the K-means 

although other forms of rapid access memory are algorithm in particular. Scaling other clustering algorithms 

contemplaled)of the computer 20 schematicaUy depicted in 45 involves displaying potentially other relevant information 

FIG. 1. In general, the data has a large number of fields so concerning the model being constructed. Of course, this 

that instead of a single dimension analysis, the invention affects only display of quantities pertaining to specific 

characterizes a large number of vectors where the dimension model. General notions such as progress bar (302), infor- 

of the vector is the number of attributes of the data records mation like 304, and buffer utilization 334, and 332 are 

in the database. The data structure 180 for this data is shown 50 independent of clustering algorithm and do not change with 

in FIG. 6C to include a number r of records having a change in clustering algorithm. Turning to FIG. 9, this 

potentially large number of attributes. screen 300 illustrates a clustering process as that clustering 

The gathering of data can be performed using either a takesplace. A progress bar 302 indicates what portion of the 

sequential scan that uses only a forward pointer to sequen- entire database has been clustered and a text box 304 above 

tially traverse the data or an indexed scan that provides a 55 the progress bar 302 indicates how many records have been 

random sampling of data from the database. It is preferable evaluated. In a center portion of the screen 300 two graphs 

in the index scan that data not be accessed multiple times. 310, 312 illustrate clustering parameters as a function of 

Tliis requires a sampling without duplication mechanism for iteration number and cluster ID respectively. The first graph 

marking data records to avoid duplicates, or a random index 310 illustrates progress of the clustering in terms of iteration 

generator that does not repeat. In particular, it is most 60 number which is displayed in the text box 314. The iteration 

preferable that the first iteration of sampling data be done number refers to the number of data gathering steps that 

randomly. If it is known the data is random within the have occurred since clustering was begun. In the FIG. 9 

database then sequential scanning is acceptable and will depiction an energy value for the clustering is calculated as 

result in best performance as the scan cost is minimized. If defined in Appendix D method 2. As the clustering continues 

it is not known that the data is randomly distributed, then 65 the energy decreases until a stopping criteria has been 

random samphng is needed to avoid an inaccurate reprc- satisfied. In the graph 310 of FIG. 9 sixteen iterations are 

sentative of the database. depicted. 
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The second graph 312 at the bottom of the screen is a Data Structures 
graph of clustering paranieters as a function of cluster 

number. In the depiction shown there are ten clusters (shown Data structures used during performance of the extended 

in the text box 316) and the minimum covariance for each EM analysis are found in FIGS. 6A-6D. An output or result 

of these ten clusters is shown. Covariance is defined from 5 of the EM analysis is a data structure 122 designated 

the model data (FIG. 6D) for a given cluster and a given MODEL which includes an array 152 of pointers to a first 

dimension by the relation: vector 154 of n elements (floats) designated 'SUM', a 

SuinSq/M-Sum*Suin/M^ second vector 156 of n elements (floats) designated 

. , c . . • <r 1 * r *SUMSO*, and a single floating point number 158 desig- 

Aplot of minimum covanance is therefore a plot of the * j (xT/^n. * *u * j * u 

dimension (l...n) for a given cluster model having the least 10 . The number M represents the number of database 

or minimum covariance. A drop down list box 320 allows represented by a given cluster. The model mcludes 

the user to select other indications of covariance. By select- ^ entries, one for each cluster. y 

ing a maximum for this parameter, a depiction of the The vector 'SUM' represents the sum ^of the weighted 

dimension of the model having the maximum covariance contribution of each database record that has been read in 

(FIG. 10) for each of the ten clusters is shown in the bar 15 from the database. As an example a typical record wiU have 

graph 312. An average covariance bar graph (HG. 12) a value of the ith dimension which contributes to each of the 

indicates the average of the covariance for each cluster over k clusters. Therefore the ith dimension of that record 

all cluster dimensions. A different user selection via the drop contributes a weighted component to each of the K sum 

down list box 320 (FIG. 13) shows the weight M for the ten vectors. A second vector 'Sumsq' is the sum of the squared 

clusters. In a similar manner, a dropdown list box 322 allows 20 components of each record which corresponds to the diago- 

different cluster parameters such as model difference nal elements of the so^alled covariance matrix. In a general 

(Appendix D, method 1) to be plotted on the graph 310. case the diagonal matrix could be a fall nxn matrix. It is 

A row 326 of command buttons at the bottom of the assumed for the disclosed exemplary embodiment that the 

screen allow a user to control a clustering process. A off diagonal elements are zero. However, this assumption is 

parameter screen button 330 allows the user to view a 25 not necessary and a full covariance matrix may be used. A 

variety of clustering parameters on a parameter screen (not third component of the model is a floatingpoint number *M'. 

shown). By accessing this screen the user can determine for T he number 'M* is determ ined^bv totaling the weighting 

example a maximum number of records or tuples that can be faSl Jor a gi ^n clusteTlc oVer at'dSapointTanM iviaing 

brought into memory to the data set RS in a given iteration, by the number of points. These records constitute the model 

As an example, the user could indicate that this maximum 30 output from the EM process and given a value somewhat 

value is 10,000 records. akin to the mean (center) and covariance (spread) of the data 

As outhned above, as the clustering process is performed fu a cluster K. 

data is summarized in DS, CS, and stored in RS. If a number /\ 

of 10,000 records is chosen as the maximum, the system V ^ a^lditional data structure designated DS in HG. 6A ^ 

limits the number of new data that can be read based upon 35 "icludes an array of pointers 160 tb at^int to a grsa i p ofk 

the number of subdlusters in the data set CS. Designate the vectoja^OJie^c^^ of n elements lf>^ designated 

number as ROWSMAX, then the amount of data records '^^^^ ^ ^^'^^"^ S^^^P \ ^^cl^''^ 1^4 designated 

that can be currently stored in RS (Rscurrent) is *SUMSQ', andLajroupje^^ 

R0WSMAX-2*c where c is the number of subclusters in datastructure is^ilar to thg data structure of HG. 6Dtfiat 
CS. Aprogress bar 332 indicates a proportion of data buffers 40 de^^esOheJdQDELjL^^ sufScient statistics for a 
that are currently being used to store RS and CS datasets. numb^ofdata recordsjbaUuYe ^een compressed mto the 
This is also depicted as a percentage in a text box 334. datasTructure shown rather than maintained m memory. 
X?i Other parameters that can be modified on the parameter Comprcssiotr5rthe data into this data structure and the CS 
screen are choice of the stopping tolerance, choice of the ^^^^ structure descnbed below frees up memory for access- 
stopping procedure, choice of parameters for combining 45 "^S ^""^^^ ^^^^ database at the step 10 on a next 
subdlusters and adding data points to subclusters, and choice subsequent iteration of the RG. 4 scalable EM process, 
of the compression procedure used to determine DS data set A further data structure designated CS in FIG. 6B is an 
candidates. The parameter screen also allows the user to array of c pointers where each pointer points to an element 
define where to store the model upon completion of the which consists of a vector of n elements (floats) designated 
clustering. If a process is suspended and the model is stored, 50 *SUM^ a vector of n elements (floats) designated *SUMSQS 
the user can also use this screen to browse the computer disk and a scalar *M\ The data structure CS also represents 
storage for different previously stored models. muUiple data points into a vector similar to the MODEL data 

Current data regarding the dataset CS is depicted in a structure, 

panel 340 of the screen. Text boxes 342, 344, 346, 348 in this , , , . , ..^ ^r,.^ • ' 

panel indicate a number of subclusters c, and average, 5^ A data stmcture designated (HG 6C) isa^pof r 

minimum and maximum variances for the subdlusters using ^ectorehavipg n dimensiQiK. bach of the dSctoSTaTn 

the above definition of variance. A last text box 350 indicates representmg a singleton .data point of ;he 

the average size of the subclusters in terms of data points or '^Pf .^^^T^ ^ ^^^^ ^^J^'^^ froin the dalaba^ aUhe step 

tuples in the subdlusters. " ■""'^"y '° "'^ '^"^ ''"'^ 

Additional command buttons allow the user to interact 60 associated withaDy_di!sla^^ implemeatation of 

v^ith the clustering as it occurs. A stop button 360 stops the '^^ ^^fiyf ^ RS is a vector of pomlers. to 

clustering and stores the results to disk. A continue button ejementsoftj^^ of the sagft l^^Jbe_SyM 

362 allows the process to be suspended and resumed by Ygctgr of Ihe^gdMa structures an d a SUMSQ vector^ s 

activating a resume button 364. A generate batch button 366 JS^EZ ^"" - 

allows the user to generate a clustering batch file which can 65 Table 1 is a list of ten SDATA records which constitute 

be executed as a separate process. Finally a close button 368 sample data from a database 10 and that arc stored as 

closes this window without stopping the clustering. individual vectors in the data structure RS. 



J 
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TABLE 1 



TABLE 2 



CaselD 


AGE 


INCOME 


CHILDREN 


CARS 


1 


30 


40 


2 


2 


2 


26 


21 


0 


1 


3 


18 


16 


0 


1 


4 


57 


71 


3 


2 


5 


41 


73 


2 


3 


6 


67 


82 


6 


3 


7 


75 


62 


4 


1 


8 


21 


23 


1 


1 


9 


45 


51 


3 


2 


10 


28 


19 


0 


0 







SUM 






Cluster # 


AGE 


INCOME 


CHILDREN 


CARS 


1 


55 


50 


2.5 


2 


2 


30 


38 


1.5 


2 



Extended EM Procedure of FIGS. 8A and 8B 

An extended EM procedure 120 (FIGS. 8A and 8B) lakes 
the contents of the three data structures RS, DS, CS and data 
from an existing model stored in the data structure of FIG. 
6D and produces a new model, llie new model is then stored 
in place of the old model. 

The procedure 120 uses the existing model to create •202Tlt^ 
an 0 1d_Model in a data structure like that of FIG. 6D, then 
determines 204 the length of the pmbter arrays oTTIGS. 
6A--6C and computes 206 means and covariance matrices 25 
from the 01d_Model SUM, SUMSQ and M data. The set of 
01d__Model means and covariance matrices are stored as_a 



Also, arbitrary values are chosen as diagonal members of 
the starting model's covariance matrices. Each cluster has a 
covariance matrix associated with it. The diagonal values of 
the starting matrices are chosen to range in size from 0.8 to 
1 .2 and are needed to start the process. Initially the float M 
of each cluster is set to 1.0. For a vector having dimension 
n=4, the covariance matrix is a 4x4 matrix. Assume cluster 
number 1 is assigned diagonal entries of the covariance 
matrix A (Below) yielding a determinant of 0.9504: 







A 






A-^ 






.8 


0.0 


0.0 


0.0 


1.25 


0.0 


0.0 


0.0 


0.0 


1.2 


0.0 


0.0 


0.0 


.83 


0.0 


0.0 


0.0 


0.0 


0.9 


0.0 


0.0 


0.0 


1.1 


0.0 


0.0 


0.0 


0.0 


1.1 


0.0 


0.0 


0.0 


0.9 



list of length K where each element of the list includes two 
parts: 

1) a vector of length n (called the "mean") w hich stores th e 
me an of the corresponding Gaussian or cluster 

2) a matrix of size nxn (called the "CVMatrix") wfiich stores 
the values of .a covariance matrix of the corresponding 
Gaussian or cluster. 

The structure holding the means and covariance matrices are 35 
referred to below as "01d_SufiFStats". 

To compute the matrix CVMatrix for a given cluster from 
the sufficient statistics SUM, SUMSQ and M (in FIG. 6D), 
the extended EfM^Jro^ ^F j ecGomputes- aii^- ^outer product 
deaaed.for-2-veet©rS5CfeRPR0D(ve^^^ The 40 

OUTERPROD operation takes 2 vectors of length n and 
returns their outer product, or the nxn matrix within entry 
in row h and column j being vectorl(h)*vector2(j). A 
DETERMINANT function computes the determinant of a 
matrix. The procedure 200 also uses a function, INVERSE 45 
thai computes the inverse of a matrix. A further function 
TRANSPOSE returns the transpose of a vector (i.e. changes 
a column vector to a row vector). The function EXP(z) 
computes the exponential e^ 

A function ' ConvertSuffStats' calculates 206 the mean 50 
and covariance matrix from the sufficient statistics stored in 
a cluster model (SUM,SUMSQ,M) 
[Mean,CVMatrix]=ConvertSufi:Stats(SUM,SUMSQ,M) 
Mean=(l/M)*SUM: 
MSq=M*M; 55 height = 

OutProd=OUTERPROD(SUM,SUM); 
CVMatrix=(l/MSq)*(M*SUMSQ-3*0utProd); 

The data structures of FIG. 6A-6D are initialized 100 
before entering the FIG. 4 processing loop. In order for the 
extended EM procedure 120 to process a first set of data read 60 
into the memory, the MODEL data structure of FIG. 60 that 
is copied into 01d__Model is not null. An initial set of cluster 
means is presented to the process. One procedure is to 
randomly choose the means and place them in the vector 
'Sum' and setting M=1.0. For a clustering number K=2 for 65 
the data format from Table 1, assume the SUM vector is 
given as Table 2 for these two clusters. 



Note the off diagonal elements are assumed to be zero. 
These facilitates the process of determining the inverse 
matrix A~^ as well as the determinant. In a similar fashion 
the matrix for cluster 2 is designated as B and has a 
determinant of 0.8448: 







B 












1.0 


0.0 


0.0 


0.0 


1.0 


0.0 


0.0 


0.0 


0.0 


1.05 


0.0 


0.0 


0.0 


.95 


0-0 


0.0 


0.0 


0.0 


.85 


0.0 


0.0 


0.0 


1.18 


0.0 


0.0 


0.0 


0.0 


.95 


0.0 


0.0 


0.0 


1.05 



A function designated * GAUSSIAN' defined below is 
used at a step 212 (FIG. 8A) to compute the height of the 
Gaussian curve above a given data point, where the Gaus- 
sian has mean=Mean and covariance matrix=CVMatrix. 
[height]«GAUSSIAN(x,Mean,CVMatrix) 
normalizing__constant=(2*PI)'^(n/2)*SQRT(DET 

(CVMatrix)); 
CVMatrixInv=.INVERSE(CVMatrix); 
Height = (l/normalizing_constant)*exp(-(Vi)* 

(TRANSPOSE(x-Mean))*CVMatrixlnv*(x-Mean)); 
Note, mathematically, the value of GAUSSIAN for a given 
cluster for the datapoint x is: 



(2;r)^W|CKMatrix| 



• exp^- i (x - Mean)''(CVMatrix)~^ (x - Meaii)j 



After resetting 208 a New_Model data structure (similar 
to FIG. 60) to all zeros, each point of the RS data structure 
is accessed 210 and used to update the New_Model. A 
contribution is determined for each of the data points in RS 
by determining the weight of each- point under the old 
model. A weight vector has k elements weight(l), 
weight(2), . . . weight(k) where each element indicates the 
normalized or fractional assignment of the data point to the 
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corresponding cluster. Recall that each data record contrib- 
utes to all of the K clusters that were set up during initial- 
ization. The mean and co-variant matrix structures allow a 
height contribution for each record (for j=l, . . . , r) to be 
determined at step 212 of the extended EM procedure 120 
for each cluster (for 1=1, . . . , k). This height contribution is 
then scaled to form a weight contribution that takes into 
account a factor M^/N where N is the total number of data 
records read thus far from the database. 

Normalizing the weight factor is performed at a step 214 
of the procedure. This step is performed by a procedure 
(referenced below) called Update Model(New_Model, 
DataPoint,vWeight) At a step 216 an outer product is cal- 
culated for the relevant vector data point in RS. An update 
step 218 loops through all k clusters and update the new 
model data structure by adding the contribution for each data 
point: 



10 



for 1 = 1, ... ^ 

Ncw_Model(l).SUM 



End for 



New_Model(2).suin + 
vWeight(l)* center; 
.Model(l).SUMSQ - ISrew_Model(l).SUMSQ + 
vWeight(l)"OuterProd; 
New_Modd(l).M_l = New_Mode](l).M_l + vWeight(l); 



New_ 



given level, that the mean of a Gaussian will not lie outside 
of the CI if it was re -calculated over a different sample of 
data. A detailed discussion of the determination of the 
Bonferroni confidence interval is found in Appendix A of 
this application. 

Returning to the data of Table 1, Case Id 4 has an age value 
of 57 years and Caseld 5 has an age of 41. These values of 
the age attribute fall within a one dimensional confidence 
interval of the means and therefore are compressed into the 
RS data structures for the two clusters wherein Caseld 4 is 
associated or belongs to the cluster G2 and Caseld 5 is 
associated or belongs to the cluster Gl. In performing this 
compression on actual data a vector distance over all dimen- 
sions of the data is calculated and a confidence interval box 
is determined. 

To identify smaller, "dense" regions on the set of data not 
compressed into the DS dataset. The following process is 
adopted. Identify many "candidate" subsets of "dense" data 
points. Then these "candidates" are fihered to make sure that 
they satisfy a specified "densit/' criterion. Two of these 
so-called subclusters SUBl, SUB2 are shown in FIG. 5. Of 
the candidates that remain after this filtering procedure, we 
merge the two nearest candidates. If the resulting merger still 
satisfies the "density" criterion, we keep merged candidate, 
otherwise it is discarded. The candidate set is determined by 
running a classic K-means algorithm on the small number of 
data points remaining in RS after data points have been 
compressed into DS. This K-means procedure determines a 
number of subclusters. The subclusters are then kept and 
merged if the maximum standard deviation along the coor- 
dinate axes is less than a threshold. 

Consider the subdluster SUB2. This subcluster is charac- 
terized by a Gaussian curve that has its own mean and 
covariance matrix. The mean appears to be about 67 years 
which is the same as the age attribute for Caseld no. 6. 
Enough other records fall within the region of this subcluster 
to warrant summarization of the data from a multiple 
number of records within this subcluster SUB2. Other 
records that cannot be classed in one of the Subdlusters are 
maintained as individual records in RS. This might be the 
case for Caseld 3 having an age of 18 which is considerably 
lower than the mean of cluster G! and is not in close 
proximity to any identified subcluster.. 

Returning to FIGS. 8A and 8B, the addition of the two 
datasets CS and DS mean that on subsequent iterations, after 
each of the single points has been updated a branch 220 is 
taken to begin the process of updating the New_Model data 
structure for all the subclusters in CS. A contribution is 
stmctures DS, CS are not null. determined for each of the data points in CS by determining 

Consider the data points in table 1 again. Assume that then^i 230 the weight of each subcluster under the old model. First 

a center vector for the subcluster is determined 230 from the 
relation center=(l/CSG).M)*CS(j). SUM. 
A weight vector has K elements weight(l), weight(2), 



30 



35 



Consider the ten records of table 1. The fourth record" has 
attributes of Age=45, Income=71 k, Children=3, and Cars= 
2. 

FIG. 7 is a graph depicting the two clusters for the 
'children' attribute for clusters A and B corresponding to the 
covariance matrices A and B summarized above. The mean 
for cluster B is 1.5 and the standard deviation is (^.9220= 
sqrt(0.85). For cluster A the mean is 2.5 and the standard 
deviation is 0.9487=sqrt(0,9). Note Caseld 4 has a value of 
children=3. This data point has a height of 0.1152 under 
Gaussian B and a height of 03660. under Gaussian A. The 
weighting factor for this point is therefor with respect to 
Gaussian A (0.3360)/(0.1152+a3360)=0.7606. One can 
assign 23.94% of Caseld No. 4 to cluster B and 76.06% of 40 
Caseld No. 4 to cluster A. The normalizing factor for 
computing the heights is 0.026. 

The process of updating the sufficient statistics for the 
New_Model continues for all points in the RS data struc- 
ture. On the first pass through the procedure, the data 45 
stmctures DS and CS are null and the RS structure is made 
up of data read from the database. Typically a portion of the 
main memory of the computer 20 (FIG. 1) is allocated for 
the storage of the data records in the RS structure. On a later 
iteration of the processing loop of FIG. 4, however, the data 50 



two clusters Gl and G2 of FIG. 5 represent two data clusters 
after a number of iterations for the age attribute of the table 
1 data. After multiple data gathering steps the means of the 
clusters are 39 and 58 yrs respectively. 

To free up space for a next iteration of data gathering from 
the database, some of the data in the stru ctu rc__RSuJa. 



s ummarized and stored in one oL thc-two-data^structiires-CS, 
or DS. (FIGS. 6A, 6B) To_define_which_dala,points-can.be 
sa8ISl5ummarized_or-compressed,-the invention sets up a 
Bonferroni confidence interval (CI) which definea ^a mult i 
dimensional "box" whose center is the current mean for th e 
K Gaussians defined in the _ MQD£L^£lG^6D-),-Iii_one 
dimenS^Ju'^his confidence interval is a span of data both 
above and below a cluster mean. The confidence interval can 
be interpreted in the following way: one is confident, to a 



55 
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65 



weight(k) where each element indicates the normalized or 
fractional assignment of a given subcluster to a cluster. This 
weight is determined 232 for each cluster and the weight 
factor is normalized at a step 234 of the procedure. This step 
is performed by a procedure (referenced later in this 
application) called UpdateModel(New__Model, SuffStat, 
vWeight). Note this function is same as procedure Update- 
Model introduced above for data points, but this one takes 
sufficient statistics (CS members, and DS members as its 
second argument). At a step 236 an outer product is calcu- 
lated for the relevant vector of the subdluster. An update step 
238 for the subcluster of CS loops through all k clusters: 
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for 1 - 1, ... ^ 

New_Modcl(l).SUM - New_Model(l).sum + weigh t(l)*center ' 
CSO').M; 

Ncw_Modcl(l).SLIMSQ = 

New_Modcl(l).SUMSQ + Wcight(2)* OutciTcmpProd - 
CSQ.M; 

New_Model(l).M_l = New_Model(l).M_l + weight(l) * 
CSG).M; 
End for 



When the contribution of all the subclusters whose suf- 
ficient statistics are contained in CS have been used to 
update the New_Model, a branch 240 is taken to update the 
New_Model using the contents of the data structure DS. A 
center for each of the k entries of DS is determined 250 from 
the relation center=(l/DS(j).M)*DS(j).SUM. A weight of 
this 'point' is then determined under the 01d_Model and the 
weight is normalized 254. ITie contributions of each of the 
subclusters is then added 260 to the sufiBcient statistics of the 
New__Model: 

An exemplary embodiment of the Procedure UpdateModel 

(New Model, SufiStat, vWeight) to work with DS members 

(sufiBcient stats): 



for 1 = 1, ... ^ 

Ncw_ModclCl).SUM = Ncw_Modcl(l).sum + weight(l)*ccntcr - 
DSG).M; 

New_Modcl(l).SUMSQ = 

New_Model(l).SUMSQ + Weight(l)'OuterTempProd • 
DS0.M; 

New_MDdel(l).M_l - New_Model(l).M_l + weight(l) * 
DSG).M; 
End for 



For 1 = 1, ... ^ 

[New_Mcan, New_CVMatrix] = ConvertSuffStats- 
(New_Model{l).SUM, 

New_model(]).SUMSQ, 
New_Model(l).M_l); 

mcan_dist = nieaa_dist + distance(01d_SuflfStats- 

(1 ) . Mean,New__mean) ; 

CVDist - CV_dist + distance(01d_SuffStats- 
(l).CVMatrix, New_CVMatrix); 
End for 



25 
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After the New_Model has been updated at the step 260 for 
each of the k clusters, the extended EM procedure tests 265 
whether a stopping criteria has been met. This test begins 
with an initialization of two variables CV_dist=0 and 
mean_dist=0. For each cluster a new co-variance matrix is 
calculated and a distance from the old mean and the new 
mean as well as a distance between the new and old 
covariance matrices is determined. These values are totaled 
for all the clusters: 



45 
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Three more detailed explanations of alternate procedures 
for computing the elements of the data set DS are presented 
below. The Third is our preferred embodiment: 
First Embodiment of DS Data Set Compression 

Let Current_Mcans denote the set {x^, P, . . . ,x* and 
Current„Data denote the set {x\x^, . . . ,x". Assume that the 
sets of sufl5cient statistics DS, CS, RS also keep track of the 
number of data represented in each. Initially the so called 
Compress_Set CS is empty. 

For each cluster 1=1, . . . , k: 

Determine the CI interval L',U' on the mean of the 1-th 
Gaussian, by one of the 2 methods discussed in Appendix A, 
for instance. 

For each element in the dataset RS x'. Let w^eR^ be the 
vector of probabilistic assignments of data point x' to the K 
Gaussians. 

Compute the perturbed centers x^,x^, ... x* for perturbed 
Gaussians by solving: 
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min /(i^)log/ ^/(^) = 



Let w'eR*^ be the vector of probabilistic assignments of 
data point x' to the K perturbed Gaussians. 

If ||w'-w'||<T, then place (x*, w*) in Discard_Set. Remove 
x' from Current_Data RS. 

Update DS by computing the sufficient statistics from 
Discard_Set. DS is then computed by determining sufScient 
statistics of a set of probability distributions "best" fitting the 
Compress_Set. 

Second Embodiment of DS Data Set Compression 

fn accordance with a second exemplary embodiment of 
the present invention a certain percentage of points "nearest" 
to the Current_Means of the K Gaussians is summarized by 
a set of suflScient statistics. The percentage may be user- 
defined or adapted from the current state of the clustering 
and dynamically changing memory requirements, for 
instance. The process begins by presenting a Mahalanobis 
distance and showing how it is related to the probability that 
a particular data point was generated by one of the K 
Gaussians in the mixture model. 

The Mahalanobis distance from a data point x_to the mean 
(center) of a multivariate Gaussian with mean x and cova- 
riance matrix S is: 



We note the multivariate data point x and the mean x are 
assimied to be column vectors. The superscript 'T' denotes 
transpose and the superscript applied to the matrix S 
denotes matrix inversion. 

The Mahalanobis distance is related to the value of the 
probability density function p(x), assuming the multivariate 
Gaussian with mean and covariance matrix specified above: 



Pix)-- 



The stopping criteria determines whether the sum of these 
two numbers is less than a stopping criteria value: 

( (1 /(2 * k)) * mea n_dist+C V_d ist) <stop_tol 

If the stopping criteria is met the New_Model becomes 
the Model and the procedure returns 268. Otherwise the 
New_Model becomes the old model and the procedure 
branches 270 back to recalculate another New_model from 
the then existing sufficient statistics in RS, DS, and CS. 



Here |S| denotes the determinant of the covariance matrix S. 
The expression above states that the larger the Mahalanobis 

60 distance of a given data point to a cluster center, the less 
likely it is that the given cluster generated the data point. A 
given percentage of the newly sampled data points, deter- 
mined to be most likely as generated by the K clusters, are 
compressed. The actual percentage of points to compress, 

65 denoted by the parameter p may be user-defined or deter- 
mined based on the current clustering and memory limita- 
tions. 
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Ijct CuiTent_Means denote the set {x^, )P, . . . , x*" and 

Current_Data be made up of New__Data, denoted as {x^, 

x^, . . . , x^ and 01d_Data (we are only concerned with 

compression of New_Data, 01d_X)ata has been compressed 

on a prior iteration). 

Initially Compress_^Set=empty 

For each cluster 1=1, . . . , k: 

Set New_Data(l) to the be the subset of New_Data that is 
nearest to current mean x' as measured by the Mahalanobis 
distance. 

Set DS(1) to be the set of sufficient statistics representing 
compression near the current mean x^. 
Set CS(1) to be the set of sufficient statistics for all subclus- 
ters nearest the current mean x^ as measured by the Mahal- 
anobis distance. 

For each element data point x''^ in New_Data(l), Set D(x*'', 
x') lo be the Mahalanobis distance from the data point to the 
current mean x'. 

Set r' to be a threshold on the Mahalanbis distance over each 

data element in Current Data(l) so that p percent of the 

elements of Current_Data(l) have a Mahalanobis distance 
less than ^ (this operation is easily done by sorting the list 
of D(x;''',_x') values). 

If D(x''',x')<r'. then add data point x*-' to Discard_Set and 
remove it from Current_Data. 

Update DS by computing the sufficient statistics of SUM and 
SUMSQ (FIG. 6A) from the Compress_Set. In general, DS 
is computed by determining sufficient statistics of a set of 
probability distributions "best" fitting the Discard_Set. 
Third Embodiment of DS Data Set Compression 

In accordance with a third exemplary embodiment of the 
present invention, a certain percentage of points that are 
"most likely" under each cluster are moved to the DS set of 
that cluster, lliis embodiment is similar to the second, except 
instead of ranking data by Mahalanobis distance, we rank it 
by likelihood value. Likelihood is the height of the Gaussian 
at the data point. The procedure proceeds as follows: 

1. For each data point, determine which cluster it is most 
likely under (which Gaussian has the highest curve at the 
data point). Call this Temporary Membership. 

2. For each cluster, rank its Temporary Members by their 
likelihood values, then move to DS the top P % of the 
rank. P is a value specified by the user. 

Note that step 2 can be equivalently performed by finding 
the nearest P % of the temporary members of a cluster when 
the distance measure is the Mahalanobis Distance intro- 
duced above. 

Calculate CS Data Structure 

Assume that the set RS consists of singleton data points 
and the points compressed by embodiments 1 and 2 above 
have been removed from the RS dataset and have been 
summarized in the DS data set. In the one dimensional case 
described previously this could include, for example, Caseld 
4 and Caseld 5. Let m be the number of singleton data 
elements left in RS. 
Set CS_New=empty. 

Set k' to be number of subcluster candidate to search for. 
Randomly choose k' elements from RS to use as an initial 
starting point for classic K-means. 

Run classic K-means over the data in RS with the initial 
point. This procedure will determine k' candidate subclus- 
ters. 

Set up a new data structure CS_New to contain the set of 
sufficient statistics for the k' candidate subclusters deter- 
mined in this manner. 

For each set of sufficient statistics in CS_New, if the number 
of data points represented by these sufficient statistics is 
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below a given threshold, remove the set of sufficient statis- 
tics from CS_New and leave the data points generating 
these sufficient statistics in RS. For each set of sufficient 
statistics in CS_New remaining, if the maximum standard 

5 deviation along any dimension of the corresponding candi- 
date subdluster is greater than a threshold P, remove the set 
of sufficient statistics from CS__New and keep the data 
points generating these sufficient statistics in RS. 
Set CS_Temp==CS_NewUCS. Augment the set of previ- 

10 ously computed sufficient statistics CS with the new ones 
surviving the filtering in steps 6 and 7. 
For each set of sufficient statistics s (corresponding to a 
sub<luster) in CS_Temp 

Determine the s', the set of sufficient statistics in CS_Temp 
15 corresponding to the nearest subcluster to the subcluster 
represented by s. 

If the subdluster formed by merging s and s', denoted by 
merge(s, s') is such that the maximum standard deviation 
along any dimension is less than p, then add merge(s, s') to 
20 CS_Temp and remove s and s' from CS_Temp. 

Set CS='CS_Temp. Remove from RS all points that went 
into CS, (RS=RS-CS.) 

Note that the function merge (s, s') simply computes the 
sufficient statistics for the sub-cluster summarizing the 
25 points in both s and s'. 

Stopping Criteria at Step 140 

The scalable Expectation Maximization analysis is 
stopped (rather than suspended) and a resultant model output 
produced when the test 140 of FIG. 4 indicates the Model is 
good enough. Two alternate stopping criteria (other than a 
scan of the entire database) are used. 

A first stopping criteria defines a probability function p(x) 
to be the quantity 

40 

where x is a data point or vector sampled from the database 
and 1) the quantity M(l) is the scalar weight for the 1th 
cluster, (The number of data elements from the database 
sampled so far represented by cluster 1) 2) N is the total 
45 number of data points or vectors sampled thus far, and 3) 
g(x|l) is the weighting factor of the data point for the 1th 
cluster. This weighting factor is determined from the SUM 
and SUMSQ and M(l) data for a given cluster as outlined 
previously. 

50 Now define a function f(iter) that changes with each 
iteration. 

1 " 

/(iter) = — ^ \o%p{Xi) 

55 

The summation in the function is over all data points and 
therefore includes the subclusters in the data structure CS, 
the summarized data in DS and individual data points in RS. 

60 When the values of p(x-) are calculated, the probablily 
function of a subcluster is determined by calculating the 
weighting factor in a manner similar to the calculation at 
step 232. Similarly the weighting factor for the k elements 
of DS are calculated in a manner similar to the step 252 in 

65 FIG. 8B. Consider two computations during two successive 
processing loops of the FIG. 4 scalable EM analysis. Des- 
ignate the calculations for these two iterations as 4 and f^.^ . 
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Then a difference parameter d^«fj-fj:_i. Evaluate the maxi- 
mum difference parameter over the last r iterations and if no 
difference exceeds a stopping tolerance ST then the first 
stopping criteria has been satisfied and the model is output. 

A second stopping criteria is the same as the stopping 5 
criteria outlined earher for the extended EM procedure 120. 
Each time the Model is updated K cluster means and 
covariance matrices are determined. The two variables 
CV__dist and mean_dist are initialized. For each cluster k 
the newly determined covariance matrix and mean are lo 
compared with a previous iteration for these paramaters, A 
distance from the old mean and the new mean as well as a 
distance between the the new and old covariance matrices is 
determined. These values are totaled for all the clusters: 

15 



For 1 = 1, ... ^ 

[New_Mean, New_CVMatrix] = ConvertSuflEStats( 
New_Model(l).SUM, New_model(l).SUMSQ, 
New_Model(l).M_l); 

mean„dist = meaiL_dist + 
d istance(01 d_Su fiEStats (1) . Mean,New__niean) ; 
CVDist - CV_dist + distance(01d_SuffStats(l).CVMatrix, 
New_CVMatrix); 
End for 



25 
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The stopping criteria determines whether the sum of these 
two numbers is less than a stopping criteria value: 

C(l/(2*k))*mean_dist+CV_dist)<stop_toI 

Multiple Model Embodiment 

In accordance with an alternate embodiment of the 
present invention, the process of FIG. 4 is supplemented 
with a model optimizer. In accordance with this 
embodiment, and as depicted in FIG. 2, a multiple number 35 
of different clustering models S are simultaneously gener- 
ated by the computer 20. In FIG. 10 one sees that the 
multiple model embodiment is implemented with an array S 
of pointers m^ . . . m^ where each pointer points to a different 
model data structure. The model data structure is depicted in 40 
FIG. 6D. In this embodiment the stmcture CS and RS are 
shared by the multiple models. 

Each of the models m^ is initialized with a different set of 
centre id vectors (value of 'sum*, M=l) for the K different 
clusters of the model. When data is gathered at the step 110, 45 
that data is used to update each of the S models. An extended 



EM procedure for the multiple model process takes into 
account the multiple model aspects of the structures DS and 
CS that is performed on each of the S models. On a first 
iteration through the FIG. 4 process there is no DS or CS 
dataset for any of the models so that all data is in the RS 
dataset. A given data point r^ in the data set RS is compressed 
into the dataset DS^ for only one of the S models even though 
it may be close enough to a centrbid to compress into a DS 
dataset for multiple models. The data point r^ is assigned to 
the set DS of the model whose centroid is closest to the point 

DS structures for all the S models are determined by 
compressing data points into the appropriate DS data struc- 
ture. The CS data structures for each of the models are then 
determined from the points remaining in RS. When per- 
forming the extended EM procedure 120, however, the CS 
sufficient statistics must be augmented with the sufficient 
statistics contained in the DS data structures of the other 
models. When performing the extended EM procedure to 
update a given model mj, the subclusters in CS must be 
augmented with the DS structures from the other models. 
Specifically, when updating model m^, the extended EM 
procedure considers the augmented set CSy=CS U DSj U 
DS^ . . . DSj_^ U DS^.,-1 U . . . DS^ when performing the loop 
230 of FIG. 7. If a data point is compressed into DS, it enters 
the DS set of only one model at the step 240, hence there is 
no double counting of data. 

The multiple model analysis can be performed until one 
of the models satisfies the stopping criteria at the step 140. 
An alternate system would continue to compute all the 
models until each model reaches a stopping criteria. 
Additionally, the scalable EM process could be performed 
until a certain percentage of the models have reached a 
stopping criteria. The multiple model implementation shares 
data structures between models and performs calculations on 
certain data unique to a given model. This analysis is 
susceptible to parallel processing on a computer 20 having 
multiple processing units 21. 

Pseudo-Code for Multiple Model Embodiment 

The following presents pseudo-code for an exemplary 
embodiment for the multiple -model, scalable cluistering. The 
previous discussion makes it clear how the various embodi- 
ments can be implemented by modifying the pseudo-code 
below. 



MAIN CLUSTERING LOOP: 
Sub GeneralizedCluster (MODELSjRemitEmptySeedSjStopToljStdlbl) 
DSs = {} 
Open(DataSource) 
While CTrue) Do { 

OldMODELs - MODELS 

SEM (MODELs,DSs,CS,RS,ReijiitEmptySeeds,StopTol,StdToI) 
If TotalRowsRcad < Length(DataSource) Then 

EmThreshold (MODELs,DSs,RS,Confidence) 

UpdateCS (CS,RS,StdTol) 
End If 

While Not StopCriteria (01dMODELs,MODELs,StopTol) 

Close (Da taSource) 

For Each Model In MODELs Do 

Save(Model) 
Next Model 
End Sub 

The Algorithm GeneralizedCluster above calls two subroutines; SEM () and 
EmThreshold Q, Tliese are defined below. These two algorithms also reference other 
routines which arc listed here: 
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[vWcightjS] = CoinputeEMWcights(Modcl,Mcan) 

[void] - UpdateModel(Model,SuffStat,vWeight) 

[void] = UpdateModel(Model,DataPoint,v Weight) 

[void] = UpdatcScedCache(SeedCachc,Mcan,RawScore) 

[SuflEStat] = GctBestGaiididate(SeedOche) 

[void] - UpdateCs(...) 



UpdateCSQ routine is described in copending application serial no. 09/040,219 to 
Fayyad et al. Which is incorporated herein by reference. 

The UpdateModclO functions only differ in the second argument and have been 
explained previously. The procedure UpdateSeed Cache Q maintains a cache of candidate 
seeds for new clusters if the main loop at some point decides to reset one of the clusters to 
a new starling point (e.g. cluster goes empty). It manages a list of candidate points. 

The procedure GetBestCandidate(SeedC&che) returns the next candidate from the 
Seed Cache. This criterion can be implemented in a variety of ways, but in the exemplary 
embodiment it returns the point that is assigned the lowest likelihood given the existing 
model (set of clusters). That is the point in SeedCache that has the lowest sum of heights 
of Gaussians from this model. This point represents a point that fits least well to the 
current model. 

Having described all the references procedures, we now list the main algorithms: 
Sub SEM (MODELs,DSs,CS,RS,ReinitEmptySeeds,StopTol,StdTol) 

For Modcllndex = 1 To Ungth(MODELS) Do 
DoModelAgain: 

M - MODEI^[Modellndex] 

ReslartModel = False 
NcwSecdCache = {} 
For Iteration - 0 to INFINITY Do 
For each X in RS Do 

[vWeight,s] - ComputeEMWeights(M,X.mean); 
UpdateModet (Mnew,X,vWeight); 
if ReinitEmptySeeds 

UpdateSeedCache (NcwSeeds,X,vWcight,s); 

NextX 

For each X in CS Do 

[vWeight^] = ComputcEMWcights(M,X.mean); 
UpdateModel (Mnew,X,vWeight); 
if ReinitEmptySeeds 

UpdateSeedCache (NewSeeds,X,vWeight,s); 

Next X 

For each DS in DSs Do 
For each X in DS Do 

[vWeight,s] = ComputeEMWeights(M^.mean); 
UpdateModel (Mnew,X,v Weight); 
Next X 
Next DS 

If ModelDiff(Mnew,M) < Stoplbl Then 

Break; 
End If 

If ReinitEmptySeeds Then 

For each Cluster in M Do 
If Cluster Is Empty 

ResetModel = True 
End If 
Next Cluster 
If ResetModel Then 

For each Cluster in M EXo 
If Cluster Is Empty Then 

Candidate = GetBestCandidate(NewScedCache) 
Clxister.Mean - Candidate. Mean 
Candidate . Remove 

End If 

Cluster. Covariance = {1,14)—} 
Next Cluster 
End If 
End If 
M = Mnew 
Next Iteration 
M = Mnew 

// at this point, M has converged but it may contain centroids 
// which are duplicates of other centroids. we remove duplicate 
// centroids at this point since they simply take up space in the 
// model which could be used by viable centroid candidates. 
// We separate this check cut from other post-convergence 
// since we allow for a variety of alternate embodiments 
// to implement various ways of handling this case, see code 
//below for other conditions which may arise. 
If CS Is Not Empty Then 

NewSeedCache - {} 

For each X in CS Do 

[vWeight,s]ComputeEMWcightsCM,X.mcan) 
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UpdatcSecdCache (NewSeedCache^.v Weights) 
NextX 

For I = 1 to Length (M) Do 

For J = I+l to Lcngth(M) Do 

If Cluster M[I] is indistinguishable from Cluster M[j] 

Then 

Candidate = GetBestCandidate(NewSeedCache) 
M[J].Mean = Candidate.Sum 

M[j].Covariancc = Candidate.Covariance 
M[J].Weight = Candidate. Weight 
Candidate. Remove 
RestartModel = Thie 
End If 
Next J 
Next I 
End If 

If RestartModel Then 

Goto DoModelAgain 
End If 
Next Modellndex 
End Sub 

EmThresholdO is a seal cable clustering implementation of the third embodiment of 
thresholding. Note that other embodiment implementations are described (Mahalanobis 
Distance, and Worst-case perturbation), but the pseudocode is restricted to this 
embodiment. 

Sub EmThreshold (MODELS,DSs,RS,Conadence) 
RSDist - Array of { -1, -1, -1, +INFINTTY } 
For Modellndex - 1 lb Length(MODELs) Do 
M = MODELs[ModelIndex] 
For RSIndex = 1 To Length(RS) Do 

[vWeight,S] = ComputeEMWeights(M,RS[RSIndex].Mean) 
ClusterNum - GetMaxPos(vWeights) 
If S < RSDist(ClusterNum).S Then 

RSDist[ RSIndex]. RSIndex - RSIndex 
RSDist[RSIndex].ModelIndex = Modellndex 
RSDist[RSIndex].CliisterNum - ClusterNum 
RSDist[RSIndex].S = S 

End If 
Next RSIndex 
Next Modellndex 

RSDist = Sort(RSDistAscending,By S) 

For I = 1 To Floor(Confidence*Length(RS)) Do 

RSIndex = RSDist[I].RSIndex 

Modellndex = RSDistflJ-Modellndex 

ClusterNum = RSDiSt[I].ClusterNum 

DS = DSs[ModelIndex] 

DS[ClusterNum].Sum += RS[ RSIndex]. Mean 

DS[ClusterNum].SumSq +- RS[RSIndex].Mean*RS[RSIndex].Mean 

DS[CiusterNum].Weight ++ 
Next I 
End Sub 



Computer System 
With reference to FIG. 1 an exemplary data processing 
system for practicing the disclosed data mining engine 
invention includes a general purpose computing device in 
the form of a conventional computer 20, including one or 
more processing units 21, a system memory 22, and a system 
bus 23 that couples various system components including 
the system memory to the processing unit 21. The system 
bus 23 may be any of several types of bus structures 
including a memory bus or memory controller, a peripheral 
bus, and a local bus using any of a variety of bus architec- 
tures< 

ITie system memory includes read only memory (ROM) 
24 and random access memory (RAM) 25. A basic input/ 
output system 26 (BIOS), containing the basic routines that 
helps to transfer information between elements within the 
computer 20, such as during start-up, is stored in ROM 24. 

The computer 20 further includes a hard disk drive 27 for 
reading from and writing to a hard disk, not shown, a 
magnetic disk drive 28 for reading from or writing to a 
removable magnetic disk 29, and an optical disk drive 30 for 



reading from or writing to a removable optical disk 31 such 
as a CD ROM or other optical media, The hard disk drive 27, 
magnetic disk drive 28, and optical disk drive 30 are 

50 connected to the system bus 23 by a hard disk drive interface 
32, a magnetic disk drive interface 33, and an optical drive 
interface 34, respectively. The drives and their associated 
computer-readable media provide nonvolatile storage of 
computer readable instructions, data structures, program 

55 modules and other data for the computer 20. Although the 
exemplary environment described herein employs a hard 
disk, a removable magnetic disk 29 and a removable optical 
disk 31, it should be appreciated by those skilled in the art 
that other types of computer readable media which can store 

60 data that is accessible by a computer, such as magnetic 
cassettes, flash memory cards, digital video disks, Bernoulli 
cartridges, random access memories (RAMs), read only 
memories (ROM), and the like, may also be used in the 
exemplary operating environment. 

65 A number of program modules may be stored on the hard 
disk, magnetic disk 29, optical disk 31, ROM 24 or RAM 25, 
including an operating system 35, one or more apphcation 
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programs 36, other program modules 37, and program data 

38, A user may enter commands and information into the J'r \ ^ 

computer 20 through input devices such as a keyboard 40 H ( n' - ^ " Zj ^^'^ = \ - ria/r)=i-a. 

and pointing device 42. Other input devices (not shown) 

may include a microphone, joystick, game pad, satellite 5 

dish, scanner, or the like. These and other input devices are Hence, in our application, let Ej be the event that the j-th 

often connected to the processing unit 21 through a serial element of the 1-th current mean lies in the between the 

port interface 46 that is coupled to the system bus, but may values L ' (lower bound) and U/ (upper bound), or 

be connected by other interfaces, such as a parallel port, specifically, L/^x/^V/. Here xf is the i-th element of the 

game port or a universal serial bus (USB). A monitor 47 or lO 1-th current mean. Here the values ofJL- and U,' define the 

other type of display device is also connected to the system 100(l-ayr)% confidence interval on x,- which is computed 
bus 23 via an interface, such as a video adapter 48. In 
addition to the monitor, personal computers typically 

include other peripheral output devices (not shown), such as / ^j. 

speakers and printers. 15 ^1 = ^* ~ ^wzjtu/v-d ' y ^ - 

The computer 20 may operate in a networked environ- 
ment using logical connections to one or more remote kt-.u u t-wj. 

^ ^ , . N IS the number of smgleton data pomts represented by 

computers such as a remote computer 49. The remote duster 1, including those that have already been compressed 

computer 49 may be another personal computer, a server, a ^^^^-^^ iterations and uncompressed data points. S/ is an 

router, a network PC, a peer device or other common 20 estimate of the variance of the 1-lh cluster along dimension 

network node, and typically includes many or all of the j Let L^U'eR" be the vectors of lower and upper bounds on 

elements described above relative to the computer 20, i\^q mean of cluster 1. 

although only a memory storage device 50 has been illus- The invention assigns data points to Gaussians in a 

trated in FIG. 1. The logical connections depicted in FIG. 1 probabalistic fashion. Two different techniques are proposed 

include a local area network (LAN) 51 and a wide area for determining the integer N, the number of singleton data 

network (WAN) 52. Such networking environments are points over which the Gaussian mean is computed. The first 

commonplace in offices, enterprise-wide computer way is motivated by the EM Gaussian center update formula 

networks, intranets and the Internet. which is computed over all of the data processed so far 

When used in a LAN networking environment, the com- (whether it h^ been compressed or not), hence in the first 

puter 20 is com^ected to the local network 51 through a variant of Oie Bonferrom CI computation we take N to be t^^ 

1 . , _f J . ei «/u J if/AXT number of data elements processed by the Scalable EM 

network mterface or adapter 53. When used in a WAN , r on. ^ • * • ^ j u .1. r . 

, . . , X. . -i/i . • 11 • 1 J algorithm so far. The second vanant is motivated by the fact 

networking environment, the computer 20 typicaUy mcludes ^^^^ ^^^^^^^ ^^^^^-^ ^^^^^^ ^^^^^ ^ ^^/^ ^^^^ 

a modem 54 or other means for establishmg communica- p^^^^^^ ^^^^ ^^^^ ^ ^- probabiUstically to a 

tions over the wide area network 52, such as the Internet. ^^^^^ Gaussian in the mixture model, hence in the second 

The modem 54, which may be internal or external, is variant of the Bonferroni computations we take N to be the 

connected to the system bus 23 via the serial port interface rounded integer of the sum of the probabilistic assignments 

46. In a networked environment, program modules depicted over all data points processed so fan 

relative to the computer 20, or portions thereof, may be The Bonferroni CI formulation assumes that the Gaussian 

stored in the remote memory storage device. It will be centers, computed over multiple data samples of size N are 

appreciated that the network connections shown are exem- cornputed as 

plary and other means of establishing a communications link 

between the computers may be used. j n 

WhUe the present invention has been described with a ^§ ^ 

degree of particularity, it is the intent that the invention 45 
include all modifications and alterations from the disclosed 

implementations falling within the spirit or scope of the This is true for the classic K-means algorithm, but is only 

appended claims. guaranteed to be true for the first iteration of the EM 

Appendix A algorithm. Hence a distribution other than the t distribution 

The following development of the Bonferroni inequality, 50 t>^tter fit the assumptions on the distribution of the 

used to determine the multidimensional confidence interval Gaussian center as computed by the EM algorithm. This 

on the mean follows from page 12 of [2] G. A. F. Seber. would result in a different formula for the Bonferroni CI. 

Multivariate Observations. John Wiley & Sons, New York, After determining the confidence intervals on the K 

1984. Gaussian means L',U^eR",l«l, . . . , k, one technique perturbs 

A conservative procedure for determining the multidimen- 55 means so that the resulting situation is a "worst case 

sional confidence interval on the mean vector of a set of scenario" for a given singleton data element. Assuming that 

multivariate observations is always available using the Bon- the data point is x*, we propose solving the following 

ferroni inequality: optimization problem for determining the perturbed cluster 

means and corresponding probabilistic assignment of data 
gQ point x' to the K perturt^ed Gaussians: 



Where E^- is the complement of E.. If we use the critical level 
of ct/r for each test, then 



mm 



k k \ 

/(^)Iog/(Jf') 2^ /(Jt') = 1, i ^ ^ iy', / = 1. Jt [. 
(=1 1=1 ) 



Here f(x') is the probabilistic assignment of data point x* to 
P{Ej)'=i~ajT=>p(Ej)=ciji the Gaussian centered at x', more specifically: 
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The perturbation becomes a more general optimization lo 
problem and the procedure used in the K-mean case is a 
special case of the the solution of this problem when 0/1 
assignments are made between points and clusters. 

We claim: 

1. In a computer system, a method for clustering of data 
records into data clusters comprising the steps of: 

a) initializing a clustering model for a plurality of data 
clusters by initializing a data distribution for each 
cluster of said plurality K of data clusters; 

b) obtaining a data portion made up of data records from 20 
a database stored on a storage medium; 

c) assigning data records from the data portion obtained 
from the database to the clustering model by scaling the 
data records with a weighting factor based upon the 
data distribution for each of the K clusters, and modi- 25 
fying the data distribution for the plurality of the data 
clusters in the cluster model by combining at least some 

of the data records from the data portion that have been 
scaled by a weighting factor with one or more clusters 
of the clustering model; 30 

d) summarizing at least some of the data records con- 
tained within the data portion obtained from the data- 
base based upon a compression criteria to produce 
sufficient statistics for data records satisfying the com- 
pression criteria; and 35 

e) continuing to obtain portions of data from the database 
and updating the clustering model based on newly 
sampled data records from the database and the suffi- 
cient statistics based on the contents of data records that 
were previously summarized tmtil a specified cluster- 40 
ing criteria has been satisfied. 

2. The method of claim 1 wherein data portions obtained 
from the database are stored in a rapid access memory 
smaller in size than the storage requirements of the database 
and wherein the sufficient statistics replace data from the 45 
database in the rapid access memory to allow further data 
from the database to be obtained from the database. 

3. The method of claim 1 wherein K weighting factors are 
determined for each data record based upon a relation of the 
data record to each cluster *s data distribution. 50 

4. llie method of claim 1 wherein the data distribution for 
each of the K clusters is based on a co-variance matrix and 
a mean which are used in determining said weighting factor. 

5. The method of claim 1 wherein the sufficient statistics 
are derived from compression of data from individual data 55 
records that satisfy a compression criteria based on the data 
distribution of a given cluster, of a model from a set of 
clusters or of a set of models each of which includes a set of 
clusters. 

6. The method of claim 1 wherein the compression criteria 60 
selects data records stored in memory for summarization 
based on a weighting of each data record in relation to a data 
distribution imphed by a cluster or a set of clusters by the 
steps of determining which of the multiple clusters a data 
record belongs to with the greatest weight and summarizing 65 
a percentage of the data records that are determined to 
belong to a cluster in this fashion. 



7. The method of claim 1 wherein the data distribution for 
a cluster is based on a mean and co-variance matrix for the 
cluster and wherein the compression criteria identifies data 
records that most accurately fit a model of the cluster for 
compression into sufficient statistics. 

8. The method of claim 1 wherein the sufficient statistics 
includes data organized as sub-clusters that do not satisfy the 
compression criteria based upon a data distribution for any 
of the K data clusters but wherein data records from the 
database for a given sub-cluster can be grouped together 
based upon a denseness criteria with respect to other data 
records obtained from the database. 

9. The method of claim 1 wherein the sufficient statistics 
are additionally derived from creation of data sub-clusters 
that do not satisfy the compression criteria based upon a data 
distribution for any of the K data clusters but wherein data 
records from a given sub-cluster can be grouped together 
based upon a denseness criteria with respect to other data 
records obtained from the database. 

10. The method of claim 1 wherein multiple clustering 
models are simultaneously generated in one or less scans of 
the data obtained from the database and wherein sufficient 
statistics for some of the data obtained from the database is 
shared in determining the multiple clustering models. 

11. The method of claim 1 additionally comprising the 
step of storing a data clustering model in a rapid access 
portion of a computer memory wherein the clustering model 
includes a number of cluster model simimaries and further 
wherein a model summary for a given cluster comprises a 
summation represented as a set of sufficient statistics from 
multiple data records that are gathered from the database. 

12. The method of claim 11 additionally comprising the 
step of maintaining an indication of cluster spread for each 
dimension of the plurality of data clusters. 

13. The method of claim 12 wherein the indication of 
spread comprises entries of a covariance matrix for the 
cluster. 

14. The method of claim 11 wherein data records are 
compressed into a set of data based upon a relation of the 
data records to a clustering model and further includes data 
sub-clusters derived from data records that do not satisfy a 
compression criteria but which are summarized and wherein 
the clustering model includes a contribution of the com- 
pressed data records and the data records summarized within 
sub -clusters based on the data distribution of the clustering 
model for a given cluster. 

15. The method of claim 1 wherein before the stopping 
criteria is reached a step is performed of choosing a new 
cluster characterization from a list of candidate clusters to 
replace one or more of the existing cluster characterization, 

16. The method of claim 15 wherein the step of choosing 
a new cluster characterization is based upon the weighted 
membership of the data records in a given cluster and 
wherein a new characterization is chosen if said weighted 
membership is less than a threshold. 

17. The method of claim 15 wherein the step of choosing 
a new cluster is performed by replacing a cluster character- 
ization with sufficient statistics from a data subcluster from 
a storage region that maintains a list of such subclusters. 

18. The method of claim 17 wherein the data subcluster 
that is chosen to replace a given subcluster is determined by 
summing heights of data attributes of the subcluster on 
clusters other than the cluster to be replaced and the sub- 
cluster having the lowest sum is the candidate subcluster that 
is chosen to replace the cluster. 

19. The method of claim 18 wherein the cluster is replaced 
by sufficient statistics of a single data record in a list of 
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individual data records and wherein the single data record is 
chosen based on a low sum of heights in data clusters other 
than the cluster to be replaced. 

20. The method of claim 1 wherein the database is 
scanned completely and then additionally clustering is per- 5 
formed by rescanning the data in the database. 

21. The method of claim 1 wherein the scanning of the 
database is random, indexed or sequential. 

22. The method of claim 21 wherein the clustering is 
performed in less than one complete scan of the database lO 
based upon a stopping criteria. 

23. The method of claim 1 additionally comprising the 
step of displaying parameters of a clustering process on a 
user interface and wherein the interface allows the user to 
slop and resume the clustering process. 15 

24. The method of claim 23 additionally comprising the 
step of updating the database before clustering is again 
started. 

25. The method of claim 23 wherein the user interface 
include controls for allowing the user to evaluate the clus- 20 
tering process, suspend the clustering process, adjust the 
number of clusters into which the data is clustered and then 
restart the clustering process. 

26. The method of claim 23 wherein the user iiiterface 
includes controls for allowing the user to save a clustering 25 
model to another storage device and then restart the clus- 
tering from the saved clustering model. 

27. In a computer data mining system, apparatus for 
evaluating data in a database comprising: 

a) one or more data storage devices for storing a database 
of records on a storage medium; 

b) a computer having a rapid access memory for storing 
data and including an interface to the storage devices 
for reading data from the storage medium and bringing 
the data into the rapid access memory for subsequent 
evaluation; and 

c) said computer comprising a processing unit for clus- 
tering of at least some of the data records in the 
database and for characterizing the data into a cluster- 
ing model having a multiple number K of data clusters 
wherein each data cluster is characterized by a data 
distribution; said processing unit programmed to 
retrieve data records from the database into the rapid 
access memory, evaluate a contribution of each record 
to a multiple number of data clusters based upon an 
existing clustering model by scaling each record with a 
scaling factor based on the data distribution for a data 
cluster, combioe at least some of the data records with 
one or more of the K data clusters based upon said 
contribution, and then summarize sufficient statistics 
for at least some of the data records before retrieving 
additional data records from the database, 

28. The apparatus of claim 27 wherein the computer 
includes a rapid access memory for storing a cluster model 
for each cluster, and further wherein each time data is 
gathered from the database the clustering model is updated 
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and then some of the data obtained from the database is 
summarized into a discard set data structure associated with 
a given cluster based upon an updated data distribution of 
the multiple clusters in the clustering model and further 
wherein the computer forms sub-clusters of data that is 
siunmarized together for use in defining the clustering 
model. 

29. The apparatus of claim 27 wherein the processor 
selects data to be compressed based on a weighting of the 
data point in relation to a data distribution for a cluster by the 
steps of determining which of the plurality of clusters a data 
point belongs to with the greatest weight and summarizing 
a percentage of the data points that are determined to belong 
to a cluster in this fashion as sulHcient statistics to charac- 
terize a cluster that belong to with said greatest weight. 

30. The apparatus of claim 27 wherein the data distribu- 
tion for a given cluster is defined by a mean and a covariance 
matrix. 

31. The apparatus of claim 30 wherein the covariance 
matrix has only diagonal entries. 

32. The apparatus of claim 30 wherein the computer 
updates muhiple clustering models and includes multiple 
processors for updating said multiple clustering models. 

33. The apparatus of claim 27 wherein the processor unit 
processes a computer program that is part of a database data 
mining engine separate firom an appHcalion program that 
requests the clustering data of a model created by the 
processor unit. 

34. In a computer data mining system, a method for 
evaluating data records in a database that is stored on a 
storage medium comprising the steps of: 

a) initializing multiple storage areas for storing multiple 
cluster models of the data records in the database 
wherein each model is made up of multiple clusters; 

b) obtaining a portion of the data records in the database 
from a storage medium; 

c) using a data distribution model for clusters of the 
database to classify data records obtained from the 
database and combining at least some of the data 
records fi-om the data portion to form updated multiple 
cluster models; 

d) summarizing data records contained within the portion 
of data based upon a compression criteria by comparing 
each data record to cluster summaries that make up the 
multiple cluster models to produce sufficient statistics 
associated with one of the multiple cluster models from 
each of the data records satisfying the compression 
criteria; and 

e) continuing to obtain portions of data from the database 
and characterizing the clustering of data in the database 
in the clustering models based on newly sampled data 
and the sufiScient statistics for each of the multiple 
cluster models until a specified criteria has been satis- 
fied. 
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